Statistics in Model

Metadata
- tags: #statistics
For a project predicting the collision likelihood between trucks and other users on the road, the topic of confidence interval was brought up
- The technique [[bootstrap-algorithm]] can be used to understand the level of confidence around regression coefficients and predictions
- Bootstrap algorithm to understand this
- Discussion with managers about how to implement this
  - Sample for training data (~80%)
  - Normalize the input data (in order to understand the coefficients in a linear regression)
  - Estimate the training data and record the estimated coefficients
  - Repeat n times
  - Note the range of variability of the estimated coefficients, make variable selections
  - Apply the model with the variability of estimated coefficient as the sample space to draw from
Designing a statistical test
- For given scenario, we would like to test if something has caused something else to change
  - Null hypothesis: no observable change
  - Alternative hypothesis: observable change
- There are four properties we care about: power, significance, sample size, effect size
  - These are interrelated with each other, and we can solve for any individual property if we define the other 3
- Power (1 - $\beta$ $β$ )
  - Is the probability to reject null hypothesis
  - $\beta$ is the probability of False Negative or Type II error
- Significance level ( $\alpha$ $α$ )
  - Is the probability to falsely concluding to reject null hypothesis (False Positive or Type I error)
  - This is usually set at a threshold of less than 0.05, giving us the confidence level of ( $1-\alpha$ ) of greater than 95%
- Sample size ( $n$ $n$ )
  - As sample size increases, the power increases even if the significance level is held constant because the variance becomes smaller
- Effect size ( $e$ $e$ )
  - The separation between the means of the two distributions
  - But often if the effect is small, then to increase the power one has to sample more or relax the significance level